Amazon Alexa Review - Sentiment Analysis¶
Amazon Alexa is a popular voice-controlled virtual assistant developed by Amazon, and its associated devices like Echo speakers have garnered a significant number of reviews on Amazon's website. The goal of this project is to extract these reviews, apply sentiment analysis using machine learning and NLP to understand the sentiments expressed by customers, and potentially gain insights into customer satisfaction or dissatisfaction with Alexa devices.
Data¶
The original data came from: https://www.kaggle.com/datasets/sid321axn/amazon-alexa-reviews
Key features¶
- Rating = rating value (between 1 and 5) given by the user
- Date = the date the review was posted
- Variation = type of amazon alexa product
- Verified reviews = the textual review given by the user for a variation of the product
- Feedback = feedback of the verified review.
- 0 - negative feedback
- 1 - positive feedback
Preparing the tools¶
# Import all the tools we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from IPython.display import SVG, display
import nltk
nltk.download('stopwords')
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
# We want our plots to appear inside the notebook
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# Machine learning
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
import re
import plotly.io as pio
pio.renderers.default = 'notebook'
# Save the model
import joblib
[nltk_data] Downloading package stopwords to [nltk_data] /usr/local/share/nltk_data... [nltk_data] Package stopwords is already up-to-date!
Load the data¶
df = pd.read_csv("amazon_alexa.tsv", delimiter = '\t', quoting= 3)
print(f"Dataset shape: {df.shape}")
df
Dataset shape: (3150, 5)
| rating | date | variation | verified_reviews | feedback | |
|---|---|---|---|---|---|
| 0 | 5 | 31-Jul-18 | Charcoal Fabric | Love my Echo! | 1 |
| 1 | 5 | 31-Jul-18 | Charcoal Fabric | Loved it! | 1 |
| 2 | 4 | 31-Jul-18 | Walnut Finish | "Sometimes while playing a game, you can answe... | 1 |
| 3 | 5 | 31-Jul-18 | Charcoal Fabric | "I have had a lot of fun with this thing. My 4... | 1 |
| 4 | 5 | 31-Jul-18 | Charcoal Fabric | Music | 1 |
| ... | ... | ... | ... | ... | ... |
| 3145 | 5 | 30-Jul-18 | Black Dot | "Perfect for kids, adults and everyone in betw... | 1 |
| 3146 | 5 | 30-Jul-18 | Black Dot | "Listening to music, searching locations, chec... | 1 |
| 3147 | 5 | 30-Jul-18 | Black Dot | "I do love these things, i have them running m... | 1 |
| 3148 | 5 | 30-Jul-18 | White Dot | "Only complaint I have is that the sound quali... | 1 |
| 3149 | 4 | 29-Jul-18 | Black Dot | Good | 1 |
3150 rows × 5 columns
Exploratory Data Analysis¶
In this phase, our focus will be on exploring the dataset. We'll carefully examine the data, searching for any gaps or missing information. We'll then proceed to analyze the data statistically, aiming to uncover patterns and trends within it. Additionally, we'll utilize visualizations to present the data in a more understandable format. This approach will enable us to gain insights into the structure of the data, identify any notable patterns, and gain a better understanding of the relationships between different variables.
df.head()
| rating | date | variation | verified_reviews | feedback | |
|---|---|---|---|---|---|
| 0 | 5 | 31-Jul-18 | Charcoal Fabric | Love my Echo! | 1 |
| 1 | 5 | 31-Jul-18 | Charcoal Fabric | Loved it! | 1 |
| 2 | 4 | 31-Jul-18 | Walnut Finish | "Sometimes while playing a game, you can answe... | 1 |
| 3 | 5 | 31-Jul-18 | Charcoal Fabric | "I have had a lot of fun with this thing. My 4... | 1 |
| 4 | 5 | 31-Jul-18 | Charcoal Fabric | Music | 1 |
df.tail()
| rating | date | variation | verified_reviews | feedback | |
|---|---|---|---|---|---|
| 3145 | 5 | 30-Jul-18 | Black Dot | "Perfect for kids, adults and everyone in betw... | 1 |
| 3146 | 5 | 30-Jul-18 | Black Dot | "Listening to music, searching locations, chec... | 1 |
| 3147 | 5 | 30-Jul-18 | Black Dot | "I do love these things, i have them running m... | 1 |
| 3148 | 5 | 30-Jul-18 | White Dot | "Only complaint I have is that the sound quali... | 1 |
| 3149 | 4 | 29-Jul-18 | Black Dot | Good | 1 |
# Display the data types of each column
print(df.dtypes)
rating int64 date object variation object verified_reviews object feedback int64 dtype: object
- rating and feedback are integer values
- date, variation and verfied_reviews are string values
# Statistical analysis
df.describe()
| rating | feedback | |
|---|---|---|
| count | 3150.000000 | 3150.000000 |
| mean | 4.463175 | 0.918413 |
| std | 1.068506 | 0.273778 |
| min | 1.000000 | 0.000000 |
| 25% | 4.000000 | 1.000000 |
| 50% | 5.000000 | 1.000000 |
| 75% | 5.000000 | 1.000000 |
| max | 5.000000 | 1.000000 |
# Column names
print(f"Feature names : {df.columns.values}")
Feature names : ['rating' 'date' 'variation' 'verified_reviews' 'feedback']
# Check for missing values
missing_values = df.isnull().sum()
print(f"Missing values in each column:\n{missing_values}")
Missing values in each column: rating 0 date 0 variation 0 verified_reviews 1 feedback 0 dtype: int64
# Getting the record where 'verified_reviews' is null
df[df['verified_reviews'].isna() == True]
| rating | date | variation | verified_reviews | feedback | |
|---|---|---|---|---|---|
| 473 | 2 | 29-Jun-18 | White | NaN | 0 |
# Drop the column with the null record since it's only one
df.dropna(inplace=True)
print(f"Dataset shape after dropping null values: {df.shape}")
Dataset shape after dropping null values: (3149, 5)
# Creating a new column 'Lenght' that contains the length of the string in 'verified_reviews' column
df['length'] = df['verified_reviews'].apply(len)
# Check the new column 'Length'
df.head()
| rating | date | variation | verified_reviews | feedback | length | |
|---|---|---|---|---|---|---|
| 0 | 5 | 31-Jul-18 | Charcoal Fabric | Love my Echo! | 1 | 13 |
| 1 | 5 | 31-Jul-18 | Charcoal Fabric | Loved it! | 1 | 9 |
| 2 | 4 | 31-Jul-18 | Walnut Finish | "Sometimes while playing a game, you can answe... | 1 | 197 |
| 3 | 5 | 31-Jul-18 | Charcoal Fabric | "I have had a lot of fun with this thing. My 4... | 1 | 174 |
| 4 | 5 | 31-Jul-18 | Charcoal Fabric | Music | 1 | 5 |
Rating column analysis¶
# Let's finf the distinct values of 'rating' and its count
print(f"Rating value count:\n{df['rating'].value_counts()}")
Rating value count: rating 5 2286 4 455 1 161 3 152 2 95 Name: count, dtype: int64
# Let's create a plot
df = pd.DataFrame(df)
fig = px.histogram(df, x='rating', title='Rating Distribution Count',
labels={'rating': 'Ratings', 'count': 'Count'},
color_discrete_sequence=['coral'])
fig.show()
# Let's find the percentage distribution of each rating - we'll divide the number of records for each rating by the total number of records
print(f"Rating value count - percentage distribution:\n{round(df['rating'].value_counts()/df.shape[0]*100,2)}")
Rating value count - percentage distribution: rating 5 72.59 4 14.45 1 5.11 3 4.83 2 3.02 Name: count, dtype: float64
# Let's create a plot
df = pd.DataFrame(df)
rating_counts = df['rating'].value_counts()
percentage_distribution = round(rating_counts / df.shape[0] * 100, 2)
fig = px.pie(names=percentage_distribution.index, values=percentage_distribution.values,
title='Rating Distribution (%)')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
- Both of the graphs reveal a notable trend towards higher ratings, with the majority of users assigning a rating of 5. This dominance of the highest rating suggests that a significant portion of users are highly satisfied with the product.
Feedback column analysis¶
# Distinct values of 'feedback' and its count
print(f"Rating value count:\n{df['feedback'].value_counts()}")
Rating value count: feedback 1 2893 0 256 Name: count, dtype: int64
We have 2 different values of 'feedback', 0 and 1.
Feedback value 0¶
# Extracting the 'verified_reviews' value for on record with feedback = 0
review_0 = df[df['feedback'] == 0].iloc[1]['verified_reviews']
print(review_0)
Sound is terrible if u want good music too get a bose
# Extracting the 'verified_reviews' value for on record with feedback = 1
review_1 = df[df['feedback'] == 1].iloc[1]['verified_reviews']
print(review_1)
Loved it!
Based on the information above, we can see that feedback 0 is a negative review and 1 is a positive review
# Let's create a plot to visualize the total counts of each feedback
data = {
'feedback': df['feedback'].astype('category')
}
df_feedback = pd.DataFrame(data)
feedback_counts = df_feedback['feedback'].value_counts().reset_index()
feedback_counts.columns = ['Feedback', 'Count']
fig = px.bar(feedback_counts, x='Feedback', y='Count',
title='Feedback Distribution Count',
labels={'Feedback': 'Feedback', 'Count': 'Count'},
color='Feedback', color_discrete_sequence=['orange', 'limegreen'])
fig.update_layout(xaxis={'type': 'category', 'categoryarray': ['0', '1']})
fig.show()
# Let's find the percentage distribution of each feedback - we'll divide the number of records for each feedback by the total number of records
print(f"Feedback value count - percentage distribution: \n{round(df['feedback'].value_counts()/df.shape[0]*100,2)}")
Feedback value count - percentage distribution: feedback 1 91.87 0 8.13 Name: count, dtype: float64
# Let's create a plot
data = {
'feedback': [1, 0],
'percentage': [91.87, 8.13]
}
df_percentage = pd.DataFrame(data)
fig = px.pie(df_percentage, values='percentage', names='feedback',
title='Feedback Distribution (%)',
hole=0.4,
color_discrete_sequence=["gold", "tomato"])
fig.show()
Feedback distribution:
- 2893 that represents the 91.9% of the total reviews, are positive
- 256 that represents the 8.13% of the total reviews, are negative
Overall, users are satisfied with the product as we saw also in the rating graphs.
Rating values vs Feedback values¶
# Feedback = 0
df[df['feedback'] == 0]['rating'].value_counts()
rating 1 161 2 95 Name: count, dtype: int64
# Feedback = 1
df[df['feedback'] == 1]['rating'].value_counts()
rating 5 2286 4 455 3 152 Name: count, dtype: int64
If the rating of a review is 1 or 2 the feedback is 0 (negative), and if the rating of a review is 3, 4 or 5 then the feedback is 1 (positive)
Variation column analysis¶
# Let's distinct values of 'variation' and its count
print(f"Variation value caunt: \n{df['variation'].value_counts()}")
Variation value caunt: variation Black Dot 516 Charcoal Fabric 430 Configuration: Fire TV Stick 350 Black Plus 270 Black Show 265 Black 261 Black Spot 241 White Dot 184 Heather Gray Fabric 157 White Spot 109 Sandstone Fabric 90 White 90 White Show 85 White Plus 78 Oak Finish 14 Walnut Finish 9 Name: count, dtype: int64
#Let's create a plot
data = {'variation': df['variation']}
df_variation = pd.DataFrame(data)
variation_counts = df_variation['variation'].value_counts().reset_index()
variation_counts.columns = ['Variation', 'Count']
fig = px.scatter(variation_counts, x='Variation', y='Count', size='Count',
title='Variation Distribution Counts',
labels={'Variation': 'Variation', 'Count': 'Count'},
hover_name='Variation', size_max=30,
color='Count', color_continuous_scale='Viridis',
template='plotly_white')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')),
selector=dict(mode='markers'))
fig.update_layout(xaxis_tickangle=-45)
fig.show()
# Fiding the percentage distribution of each variation - we'll divide the number of records for each variation by total number of records
print(f"Variation value count - percentage distribution: \n{round(df['variation'].value_counts()/df.shape[0]*100,2)}")
Variation value count - percentage distribution: variation Black Dot 16.39 Charcoal Fabric 13.66 Configuration: Fire TV Stick 11.11 Black Plus 8.57 Black Show 8.42 Black 8.29 Black Spot 7.65 White Dot 5.84 Heather Gray Fabric 4.99 White Spot 3.46 Sandstone Fabric 2.86 White 2.86 White Show 2.70 White Plus 2.48 Oak Finish 0.44 Walnut Finish 0.29 Name: count, dtype: float64
# Let's create a plot
variation_counts = df['variation'].value_counts()
percentage_distribution = round(variation_counts / df.shape[0] * 100, 2)
df_percentage = pd.DataFrame({'Variation': percentage_distribution.index,
'Percentage': percentage_distribution.values})
fig = px.bar(df_percentage, x='Variation', y='Percentage',
title='Variation Distribution (%)', color="Percentage")
fig.update_layout(xaxis_title='Variation', yaxis_title='Percentage (%)',xaxis_tickangle=-45, barmode='stack')
fig.show()
- We can see that users prefer the Black Dot and their least favorite is the Walnut Finish, this can be due to different reasons such as difference in price and affordability, availability and market trends.
Verified reviews column analysis¶
We are going to use the length column since it's the numeric representation of the verified_reviews column
df['length'].describe()
count 3149.000000 mean 132.714513 std 182.541531 min 1.000000 25% 30.000000 50% 74.000000 75% 166.000000 max 2853.000000 Name: length, dtype: float64
Length analysis when feedback is 0 - negative¶
# Let's create a plot
filtered_df = df[df['feedback'] == 0]
length_counts = filtered_df['length'].value_counts().sort_index()
df_length_counts = pd.DataFrame({'Review Length': length_counts.index,
'Count': length_counts.values})
fig = px.histogram(df_length_counts, x='Review Length', y='Count',
title='Distribution of Review Length when Feedback is 0',
labels={'Review Length': 'Review Length', 'Count': 'Count'},
template='plotly_white',
color_discrete_sequence=['orchid'])
fig.update_layout(xaxis_title='Review Length', yaxis_title='Count')
fig.show()
Length analysis when feedback is 1 - positive¶
# Let's create a plot
filtered_df = df[df['feedback'] == 1]
length_counts = filtered_df['length'].value_counts().sort_index()
df_length_counts = pd.DataFrame({'Review Length': length_counts.index,
'Count': length_counts.values})
fig = px.histogram(df_length_counts, x='Review Length', y='Count',
title='Distribution of Review Length when Feedback is 1',
labels={'Review Length': 'Review Length', 'Count': 'Count'},
template='plotly_white',
color_discrete_sequence=['orange'])
fig.update_layout(xaxis_title='Review Length', yaxis_title='Count')
fig.show()
In both cases, when the feedback is 0 (negative) and 1 (positive), the length of the reviews are on the shorter side meaning that the users don't have the tendency to write long reviews.
Lengthwise mean rating¶
# Let's create a plot
mean_ratings_by_length = df.groupby('length')['rating'].mean().reset_index()
fig = px.scatter(mean_ratings_by_length, x='length', y='rating',
title='Review Lengthwise Mean Ratings',
labels={'length': 'Review Length', 'rating': 'Mean Rating'},
template='plotly_white',
color_discrete_sequence=['darkturquoise'])
fig.update_layout(xaxis_title='Review Length', yaxis_title='Mean Rating')
fig.show()
The mean lenght is high for positive feedback, customers who are satisfied with a product often want to share their positive experiences with others. They may include specific details, anecdotes, or use cases to highlight why they enjoyed the product.
# We are going to use CountVectorizer to take the textual data and convert it into vector representations
cv = CountVectorizer(stop_words='english')
words = cv.fit_transform(df.verified_reviews)
# We want to combine all reviews
reviews = " ".join([review for review in df['verified_reviews']])
# Initialize wordcloud object
wc = WordCloud(background_color='white', max_words=80)
# Generate and plot wordcloud
plt.figure(figsize=(10, 10))
plt.imshow(wc.generate(reviews))
plt.title('Wordcloud for all reviews', fontsize=15)
plt.axis('off');
Diving the data into positive and negative feedback¶
# Combine all reviews for each feedback category and split them into individual words
neg_reviews = " ".join([review for review in df[df['feedback']==0]['verified_reviews']])
neg_reviews = neg_reviews.lower().split()
pos_reviews = " ".join([review for review in df[df['feedback']==1]['verified_reviews']])
pos_reviews = pos_reviews.lower().split()
# Finding words from reviews present in that feedback category only
unique_negative = [x for x in neg_reviews if x not in pos_reviews]
unique_negative = " ".join(unique_negative)
unique_positive = [x for x in pos_reviews if x not in neg_reviews]
unique_positive = " ".join(unique_positive)
Negative feedback¶
wc = WordCloud(background_color = 'white', max_words=80)
# Generate and plot wordcloud
plt.figure(figsize=(10,10))
plt.imshow(wc.generate(unique_negative))
plt.title('Wordcloud for negative reviews', fontsize=15)
plt.axis('off');
Positive feedback¶
wc = WordCloud(background_color = 'white', max_words=80)
# Generate and plot wordcloud
plt.figure(figsize=(10,10))
plt.imshow(wc.generate(unique_positive))
plt.title('Wordcloud for negative reviews', fontsize=15)
plt.axis('off');
Preprocessing¶
To build the corpus from the verified_reviews we are going to do the following:
- Replace any non alphabet characters with a space
- Convert to lower case and split into words
- Iterate over the individual words and if it is not a stopword then add the stemmed form of the word to the corpus
corpus = []
stemmer = PorterStemmer()
for i in range(0, df.shape[0]):
review = re.sub('[^a-zA-Z]', ' ', df.iloc[i]['verified_reviews'])
review = review.lower().split()
review = [stemmer.stem(word) for word in review if not word in STOPWORDS]
review = ' '.join(review)
corpus.append(review)
Using Count Vectorizer to create bag of words:
- We do this to convert textual data into numerical representations since this data will be used to feed our machine learning models
cv = CountVectorizer(max_features= 2500)
# Storing independent and dependent variables in X and y
X = cv.fit_transform(corpus).toarray()
y = df['feedback'].values
# Saving the Count Vector
joblib.dump(cv, open('countvectorizer.pkl', 'wb'))
# Check the shape of X and y
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
X shape: (3149, 2500) y shape: (3149,)
Train-Test split¶
# Splitting the data into train and test sets with 30% data with testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state=15)
print(f"X train: {X_train.shape}")
print(f"y train: {y_train.shape}")
print(f"X test: {X_test.shape}")
print(f"y test: {y_test.shape}")
X train: (2204, 2500) y train: (2204,) X test: (945, 2500) y test: (945,)
Standardization¶
- We do this so our data is more suitable and to make sure that all the features contribute equally to the model.
print(f"X train max value: {X_train.max()}")
print(f"X test max value: {X_test.max()}")
X train max value: 12 X test max value: 10
# We are going to scale X_train and X_test so all the values are between 0 and 1
scaler = MinMaxScaler()
X_train_scl = scaler.fit_transform(X_train)
X_test_scl = scaler.transform(X_test)
# Save the scaler model
joblib.dump(scaler, open('scaler.pkl', 'wb'))
Modeling¶
Random Forest¶
# Fitting scaled X_train and y_train on RandomForestClassifier
model_rf = RandomForestClassifier()
model_rf.fit(X_train_scl, y_train)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
# Accuracy of the model on training and testing data
print("Training Accuracy:", model_rf.score(X_train_scl, y_train))
print("Testing Accuracy:", model_rf.score(X_test_scl, y_test))
Training Accuracy: 0.9945553539019963 Testing Accuracy: 0.9470899470899471
# Predicting on the test set
y_preds = model_rf.predict(X_test_scl)
# Confusion Matrix
cm = confusion_matrix(y_test, y_preds)
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=model_rf.classes_)
cm_display.plot();
K fold cross-validation¶
accuracies = cross_val_score(estimator = model_rf, X = X_train_scl, y = y_train, cv = 10)
print("Accuracy:", accuracies.mean())
print("Standard Variance: ", accuracies.std())
Accuracy: 0.9324002468120115 Standard Variance: 0.008689794785676577
We are going to appy GridSearch to get the optimal parameters on RandomForest
params = {
'bootstrap':[True],
'max_depth': [80,100],
'min_samples_split': [8,12],
'n_estimators': [100,100]
}
cv_object = StratifiedKFold(n_splits = 2)
grid_search = GridSearchCV(estimator = model_rf, param_grid= params, cv = cv_object, verbose= 0, return_train_score= True)
grid_search.fit(X_train_scl, y_train.ravel())
GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
estimator=RandomForestClassifier(),
param_grid={'bootstrap': [True], 'max_depth': [80, 100],
'min_samples_split': [8, 12],
'n_estimators': [100, 100]},
return_train_score=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
estimator=RandomForestClassifier(),
param_grid={'bootstrap': [True], 'max_depth': [80, 100],
'min_samples_split': [8, 12],
'n_estimators': [100, 100]},
return_train_score=True)RandomForestClassifier()
RandomForestClassifier()
# Getting the best parameters from the grid search
print("Best Parameter Combination : {}".format(grid_search.best_params_))
Best Parameter Combination : {'bootstrap': True, 'max_depth': 100, 'min_samples_split': 12, 'n_estimators': 100}
print("Cross validation mean accuracy on train set : {}".format(grid_search.cv_results_['mean_train_score'].mean()*100))
print("Cross validation mean accuracy on test set : {}".format(grid_search.cv_results_['mean_test_score'].mean()*100))
print("Accuracy score for test set :", accuracy_score(y_test, y_preds))
Cross validation mean accuracy on train set : 96.80127041742288 Cross validation mean accuracy on test set : 92.18466424682396 Accuracy score for test set : 0.9470899470899471
XGBClassifier¶
xgb_model = XGBClassifier()
xgb_model.fit(X_train_scl, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)# Accuracy of the model on training and testing data
print("Training Accuracy:", xgb_model.score(X_train_scl, y_train))
print("Testing Accuracy:", xgb_model.score(X_test_scl, y_test))
Training Accuracy: 0.971415607985481 Testing Accuracy: 0.9417989417989417
y_preds = xgb_model.predict(X_test)
# Confusion Matrix
cm = confusion_matrix(y_test, y_preds)
print(cm)
[[ 31 47] [ 13 854]]
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=xgb_model.classes_)
cm_display.plot();
Decision Tree Classifier¶
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scl, y_train)
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
# Accuracy of the model on training and testing data
print("Training Accuracy:", dt_model.score(X_train_scl, y_train))
print("Testing Accuracy:", dt_model.score(X_test_scl, y_test))
Training Accuracy: 0.9945553539019963 Testing Accuracy: 0.9238095238095239
y_preds = dt_model.predict(X_test)
# Confusion matrix
cm = confusion_matrix(y_test, y_preds)
print(cm)
[[ 44 34] [ 99 768]]
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dt_model.classes_)
cm_display.plot();
Model Comparison¶
model_scores = {
'Model': ['RandomForestClassifier', 'XGBClassifier', 'DecisionTreeClassifier'],
'Training': [0.9945553539019963, 0.971415607985481, 0.9945553539019963],
'Testing': [0.9185185185185185, 0.9417989417989417, 0.9428571428571428]
}
df_scores = pd.DataFrame(model_scores)
df_melted = df_scores.melt(id_vars='Model', var_name='Dataset', value_name='Accuracy')
fig = px.bar(df_melted, x='Model', y='Accuracy', color='Dataset',
barmode='group', labels={'Accuracy': 'Accuracy Score', 'Dataset': 'Accuracy'},
title='Model Comparison: Training and Testing Accuracy')
fig.show()
Based on the results, the XGBClassifier model performed the best among the three models because the training and testing scores are not that different so we can say that the model is not overfitted, which is not the case for the RandomForestClassifier and the DecisionTreeClassifier models. XGBClassifier can be a better model due to it's ability to reduce overfitting and produce accurate predictions.
Save the model¶
# Saving the XGBoost classifier
joblib.dump(xgb_model, open('xgb_model.pkl', 'wb'))